Suppose you want to by \(x\) shares of TGT stock.
\[ y = 5 + 64.96 x. \]
A linear relationship between an explanatory variable \(x\) and a response variable \(y\) can be estimated by a regression line:
\[ \hat{y} = b_0 + b_1 x \]
We use the symbol \(\hat{y}\) to emphasize that this is a predicted value of \(y\).
Researchers captured 104 of brushtail possums and took body measurements before releasing the animals back into the wild. We consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum’s head.
The equation of the regression line for predicting Head Length from Total Length is \(\hat{y} = 41 + 0.59x\).
Use the regression model to predict the Head Length of a possum whose Total Length is 76.0 cm.
One of the possums in the sample had a Total Length of 76.0 cm and a Head Length of 85.1 mm. Did the model prediction overestimate or underestimate the Head Length? By how much?
Residuals are the leftover variation in the data after accounting for the model fit.
Everyone (especially the Operator): open the following applet:
https://www.rossmanchance.com/applets/2021/regshuffle/regshuffle.htm
Click the Show Regression Line checkbox and record the equation of the regression line.
Click Show Residuals. Find the largest negative residual and click on its point in the scatterplot. Record the value of the residual.
Delete this row by clicking the Delete button that appeared when you did part 2. How did this affect the slope of the regression line?
The correlation coefficient \(r\) describes the strength and direction of a linear relationship.
| range of \(r\) | Strength | Meaning |
|---|---|---|
| \(0.7 \leq \lvert r \rvert \leq 1\) | Strong | Points almost form a line. |
| \(0.3 \leq \lvert r \rvert \leq 0.7\) | Moderate | Clear pattern, but bloblike. |
| \(0.1 \leq \lvert r \rvert \leq 0.3\) | Weak | Slight pattern. |
| \(0 \leq \lvert r \rvert \leq 0.1\) | None | No discernible trend. |
A: \(r = -0.54\)
B: \(r = 0.16\)
C: \(r = 0.46\)
D: \(r = -0.44\)
E: \(r = 0.69\)
F: \(r = 0.85\)
Return to the correlation applet of the previous group exercise.
Reload the applet to get it back to the original state.
Check on Show Regression Line and Correlation coefficient. Record the value of \(r\).
Now try deleting points from the scatterplot and see if you can get the value of \(r\) to be greater than 0.85. How many points did you have to delete?
Reload the applet.
Under Explore Lines, check on show movable line and show squared residuals.
Move the line around and try to make SSE as small as possible. (SSE is the sum of the squared residuals, i.e., the total area of all those squares.)
Rows: 50
Columns: 3
$ family_income <dbl> 92.922, 0.250, 53.092, 50.200, 137.613, 47.957, 113.534, 168.579, 208.115, 1…
$ gift_aid <dbl> 21.720, 27.470, 27.750, 27.220, 18.000, 18.520, 13.000, 13.000, 14.000, 25.4…
$ price_paid <dbl> 14.280, 8.530, 14.250, 8.780, 24.000, 23.480, 23.000, 29.000, 28.000, 16.530…
Call:
lm(formula = gift_aid ~ family_income, data = elmhurst)
Residuals:
Min 1Q Median 3Q Max
-10.1128 -3.6234 -0.2161 3.1587 11.5707
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.31933 1.29145 18.831 < 2e-16 ***
family_income -0.04307 0.01081 -3.985 0.000229 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.783 on 48 degrees of freedom
Multiple R-squared: 0.2486, Adjusted R-squared: 0.2329
F-statistic: 15.88 on 1 and 48 DF, p-value: 0.0002289
\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]
\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]
family_income, the amount of gift_aid
decreases by about \(\$43\).\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]
1
20.8736
The predicted gift_aid for a family_income
of \(\$80,000\) is about \(\$20,873\).
\[ \widehat{\texttt{gift_aid}} = 24.3 - 0.0431 \times \texttt{family_income} \]
1
-104.8956
The predicted gift_aid for a family_income
of \(\$3,000,000\) is negative!
gift_aid can be explained by
the linear relationship with family_income.\[ r= \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i -\bar{y}}{s_y} \right) \]
\[ \begin{align} \hat{y} &= a + bx && \text{regression line} \\ b &= r \frac{s_y}{s_x} && \text{sample slope} \\ a &= \bar{y} - b\bar{x} && \text{sample $y$-intercept} \\ \end{align} \]
Reload the regression applet.
Click on Show Data Options and Show Regression Line.
Add the point \((38, 75)\). Did the regression line change?
Add the point \((38,36)\). Did the regression line change?
Now click on Move Observations. Try moving that last observation around, and observe what happens to the regression line.
Discuss: What types of observations are most influential on the regression line?